Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

ledger-tool: verify: add --record-slots and --verify-slots #34246

Merged

Conversation

seanyoung
Copy link
Contributor

@seanyoung seanyoung commented Nov 28, 2023

This adds:

    --record-slots <FILENAME>
        Write the slot hashes to this file, optionally with all the acounts

    --record-slots-config hash-only|bank
        Store the bank (=accounts) json file, or not.

    --verify-slots <FILENAME>
        Verify slot hashes against this file.

The first case can be used to dump a list of (slot, hash) to a json file
during a replay. The second case can be used to check slot hashes against
previously recorded values.

This is useful for debugging consensus failures, eg:

    # on good commit/branch
    ledger-tool verify --record-slots good.json --record-slots-config=bank

    # on bad commit or potentially consensus breaking branch
    ledger-tool verify --verify-slots good.json

On a hash mismatch an error will be logged with the expected hash vs the computed hash.

Problem

Summary of Changes

Fixes #

@seanyoung
Copy link
Contributor Author

This is #29189 with the following changes:

  • rebased
  • single command line option
  • tested

@seanyoung seanyoung marked this pull request as ready for review November 28, 2023 19:54
Copy link

codecov bot commented Nov 28, 2023

Codecov Report

Attention: Patch coverage is 66.66667% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 81.7%. Comparing base (6ee3bb9) to head (e334f34).
Report is 18 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #34246    +/-   ##
========================================
  Coverage    81.7%    81.7%            
========================================
  Files         834      834            
  Lines      224299   224830   +531     
========================================
+ Hits       183361   183835   +474     
- Misses      40938    40995    +57     

@alessandrod
Copy link
Contributor

Thanks for picking this up! Since I did the original patch, this was added to the validator and ledger tool #32632. So it seems that this change is now kinda redundant? One can achieve the same by doing a first run with --write-bank-file, diff the bank files, then use --halt-at-slot to stop at the offending slot. Even if it's a couple extra steps, the bank file includes fields that likely make debugging the divergence faster.

@steviez
Copy link
Contributor

steviez commented Nov 29, 2023

One can achieve the same by doing a first run with --write-bank-file, diff the bank files, then use --halt-at-slot to stop at the offending slot. Even if it's a couple extra steps, the bank file includes fields that likely make debugging the divergence faster.

A few notes about the --write-bank-file flag:

  • It outputs the account of every touched account for a given slot to a human readable json file. As such, it has non-trivial file size (one from mnb from several months ago was 20 MB for one slot)
    • I only mention the file size as I'm not sure how many slots your typical "test" ledger is. Assuming 20 MB is representative (not sure that it is), that's 1 GB for every 50 slots. If you're running on the beefy dev server, then space probably not a concern. A laptop might be another
  • Currently, the flag only supports creating a file for the working bank (highest slot at end of replay). It could be an enhancement to allow creating these at every slot along the way
  • To your last point, absolutely, these files shave a bunch of time off the divergence debugging process. You'll be able to quickly tell which account(s) differed, which is often enough to isolate down to a single transaction that differed.

I uploaded one of these files here for the sake of seeing what exactly it contains:
212112050-GoZ4MYBB21YiiVCHWZHzKaMebtvk1sFWBqpzT3nZwk6s.json

@alessandrod
Copy link
Contributor

One can achieve the same by doing a first run with --write-bank-file, diff the bank files, then use --halt-at-slot to stop at the offending slot. Even if it's a couple extra steps, the bank file includes fields that likely make debugging the divergence faster.

A few notes about the --write-bank-file flag:

* It outputs the account of every touched account for a given slot to a human readable json file. As such, it has non-trivial file size (one from mnb from several months ago was 20 MB for one slot)

Oh I see, that's a lot larger than I thought. Maybe we can add an option to opt out of including the actual account data?

  * I only mention the file size as I'm not sure how many slots your typical "test" ledger is. Assuming 20 MB is representative (not sure that it is), that's 1 GB for every 50 slots. If you're running on the beefy dev server, then space probably not a concern. A laptop might be another
* Currently, the flag only supports creating a file for the working bank (highest slot at end of replay). It could be an enhancement to allow creating these at every slot along the way

Yeah for the purpose of smoke testing the runtime I usually replay very large ledgers, the more slots the better really. And data for each slot would have to be generated, preferably by changing the format to include multiple slots, dumping hundreds of files doesn't seem ideal.

* To your last point, absolutely, these files shave a bunch of time off the divergence debugging process. You'll be able to quickly tell which account(s) differed, which is often enough to isolate down to a single transaction that differed.

But so at the moment you must still separately figure out which slot diverges, then generate a bank file for it and diff right? If the format was changed to include multiple slots in order, then from the diff itself it would be possible to tell when the divergence happened I think right?

What I'm thinking is this: I've used the code in this PR many times and it helped me find many runtime bugs early, so obvs I would like to have the functionality in some form. The way I've used it was to find the offending broken slot, then add logs to strategic places to figure out diffs hashes etc... which --write-bank-file conveniently outputs to a file. So somehow merging these two features makes a lot of sense I think?

@steviez
Copy link
Contributor

steviez commented Nov 30, 2023

Oh I see, that's a lot larger than I thought. Maybe we can add an option to opt out of including the actual account data?

I'm onboard with this, and this idea would be inline with another idea I had to optionally include transaction information. The transaction information would further float size so not great for a total test bench, but potentially useful in automating some of the tedious work of honing in on the bug.

Yeah for the purpose of smoke testing the runtime I usually replay very large ledgers, the more slots the better really.

Yep, understood

And data for each slot would have to be generated, preferably by changing the format to include multiple slots, dumping hundreds of files doesn't seem ideal.

Yeah, I think that is reasonable as well. As you can see, there is some metadata that won't change (namely the version string but could add more down the road). But, we could add everything else into a slots array, and for the sake of solana-validator dumping a single slot when it has diverged, we can just output the array of size 1.

What I'm thinking is this: I've used the code in this PR many times and it helped me find many runtime bugs early, so obvs I would like to have the functionality in some form. The way I've used it was to find the offending broken slot, then add logs to strategic places to figure out diffs hashes etc... which --write-bank-file conveniently outputs to a file. So somehow merging these two features makes a lot of sense I think?

Got it. At a high level, I see two current cases for this feature:

  1. When a validator diverges, generate this file. The file can then also be generated with a known, good version and diffed. This is useful for us comparing version-to-version, as well as for multiple clients (the Firedancer team expressing desire for this feature was the final nudge to push this in)
  2. Kind of similar to above, but in your case, you're testing just runtime changes via solana-ledger-tool. For this case, the variable level of verbosity seems to fit:
    • You generate a file with just the components that go into bank hash as your "first level" for comparison to use as your known good set
    • You then develop some feature and want to test it; so you run and check against the known set
      • If the bank hashes match throughout replay, you're good to go
      • If not, then you should have the diverging slot. Then, you potentially go back and do --write-bank-file with the extra details (ie accounts) with the known-good-version AND your under-development-version, and diff those files to hone in.

From above, 1. is more me telling you how I use it; do you think 2. accurately captures how you would want to use this feature if we incorporate the "optional" details?

@seanyoung
Copy link
Contributor Author

I'm making changes to the runtime and testing with this feature. Hopefully by the time I'm done I'll have some bright ideas, or maybe not. We'll see

@steviez
Copy link
Contributor

steviez commented Dec 1, 2023

I'm making changes to the runtime and testing with this feature. Hopefully by the time I'm done I'll have some bright ideas, or maybe not. We'll see

Sounds good! I'll be out next week (Dec 4-8), but back the week after (Dec 11). Can review then, or if you still have other things you're working on (or other things that you would rather work on), I could also take this task off of your hands as well.

@alessandrod
Copy link
Contributor

Yeah, I think that is reasonable as well. As you can see, there is some metadata that won't change (namely the version string but could add more down the road). But, we could add everything else into a slots array, and for the sake of solana-validator dumping a single slot when it has diverged, we can just output the array of size 1.

Yep makes sense

What I'm thinking is this: I've used the code in this PR many times and it helped me find many runtime bugs early, so obvs I would like to have the functionality in some form. The way I've used it was to find the offending broken slot, then add logs to strategic places to figure out diffs hashes etc... which --write-bank-file conveniently outputs to a file. So somehow merging these two features makes a lot of sense I think?

Got it. At a high level, I see two current cases for this feature:

1. When a validator diverges, generate this file. The file can then also be generated with a known, good version and diffed. This is useful for us comparing version-to-version, as well as for multiple clients (the Firedancer team expressing desire for this feature was the final nudge to push this in)

2. Kind of similar to above, but in your case, you're testing just runtime changes via `solana-ledger-tool`. For this case, the variable level of verbosity seems to fit:
   
   * You generate a file with just the components that go into bank hash as your "first level" for comparison to use as your known good set
   * You then develop some feature and want to test it; so you run and check against the known set
     
     * If the bank hashes match throughout replay, you're good to go
     * If not, then you should have the diverging slot. Then, you potentially go back and do `--write-bank-file` with the extra details (ie accounts) with the known-good-version AND your under-development-version, and diff those files to hone in.

From above, 1. is more me telling you how I use it; do you think 2. accurately captures how you would want to use this feature if we incorporate the "optional" details?

Yep 2 is exactly what I'd want! :)

@seanyoung
Copy link
Contributor Author

Looks like we've settled on a design, it does look good. I'll implement it

@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch 2 times, most recently from dd47f23 to efd77f3 Compare December 5, 2023 15:26
@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch from efd77f3 to 53a74c4 Compare December 19, 2023 10:15
@seanyoung seanyoung changed the title ledger-tool: verify: add --verify-slot-hashes ledger-tool: verify: add --verify-slots and --verify-slots-details Dec 19, 2023
@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch 2 times, most recently from 4e0b427 to 57ed4a9 Compare December 19, 2023 10:40
@seanyoung
Copy link
Contributor Author

Here is the re-worked version, that I've been using for debugging #34057. There is a new option --verify-slots-details which takes either none, tx, bank, or bank-tx. With bank-tx, for each slot the bank is included and also the transactions. The transactions are in batches and included the instructions/accounts/data and results, and also crucially a per-tx-batch bank hash, so you know exactly at which tx batch things go wrong.

Please let me know what you think.

@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch 3 times, most recently from 41f85e8 to 678940b Compare December 19, 2023 11:49
@steviez
Copy link
Contributor

steviez commented Dec 19, 2023

Here is the re-worked version, that I've been using for debugging #34057. There is a new option --verify-slots-details which takes either none, tx, bank, or bank-tx. With bank-tx, for each slot the bank is included and also the transactions. The transactions are in batches and included the instructions/accounts/data and results, and also crucially a per-tx-batch bank hash, so you know exactly at which tx batch things go wrong.

Please let me know what you think.

Hi Sean, I'll take a look today. One request for the future - could you please include new changes in separate commits as opposed to combining with the existing commit ? Even if they end up getting combined on merge, it is nice to be able to flip through the commit-by-commit diffs (ie if I want to see what has changed with your latest update only)

Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking/thinking about this! I'm glad you are also interested / see value in adding information for mid-slot (ie tx-by-tx); I've done something similar and have heard others express value in getting that. The general direction of this PR is good, maybe just some tweaks to how we specifically do several things.

That being said, I think we should take an incremental approach to keep PR's smaller

  • I have a PR that extends the existing BankHashDetails to allow multiple slots: Allow bank hash file to contain details for multiple slots #34516
  • With that, we could update the file to optionally include / exclude the accounts contents
    • The current BankHashDetails file includes them by default, but we probably want to exclude those details if we want to generate a file for 1000's of slots
  • Rebase this PR on top of the above two and enable ledger-tool to create / read those files
  • In a subsequent PR, add support for storing "intermediate" (ie tx-by-tx) details or other improvements

What do you think ?

ledger/src/blockstore_processor.rs Outdated Show resolved Hide resolved
ledger-tool/src/main.rs Outdated Show resolved Hide resolved
ledger-tool/src/main.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Jan 3, 2024
@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch from 678940b to 8013ba6 Compare January 5, 2024 18:01
@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch from 8013ba6 to edcbb88 Compare January 6, 2024 15:21
@seanyoung
Copy link
Contributor Author

Thanks for looking/thinking about this! I'm glad you are also interested / see value in adding information for mid-slot (ie tx-by-tx); I've done something similar and have heard others express value in getting that. The general direction of this PR is good, maybe just some tweaks to how we specifically do several things.

That being said, I think we should take an incremental approach to keep PR's smaller

* I have a PR that extends the existing `BankHashDetails` to allow multiple slots: [Allow bank hash file to contain details for multiple slots #34516](https://github.com/solana-labs/solana/pull/34516)

* With that, we could update the file to optionally include / exclude the accounts contents
  
  * The current `BankHashDetails` file includes them by default, but we probably want to exclude those details if we want to generate a file for 1000's of slots

* Rebase this PR on top of the above two and enable ledger-tool to create / read those files

* In a subsequent PR, add support for storing "intermediate" (ie tx-by-tx) details or other improvements

What do you think ?

Makes a lot of sense. Now this PR does not include transactions but does optionally include the bank, and writes the new BankHashDetails format. @steviez please let me know what you think.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 22, 2024
@steviez steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 23, 2024
@github-actions github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Feb 6, 2024
@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch 2 times, most recently from 258d0eb to 4aa9daa Compare February 13, 2024 00:13
ledger-tool/src/main.rs Outdated Show resolved Hide resolved
@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch from 4aa9daa to 4e7f09d Compare February 24, 2024 10:13
ledger-tool/src/main.rs Outdated Show resolved Hide resolved
Comment on lines +1767 to +1797
// writing the json file ends up with a syscall for each number, comma, indentation etc.
// use BufWriter to speed things up

let writer = std::io::BufWriter::new(recorded_slots_file);

serde_json::to_writer_pretty(writer, &bank_hashes).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicate code with stuff we have in bank_hash_details.rs, would be nice to single source that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn write_bank_hash_details_file() write a single bank, and here we are writing a Vec<BankHashDetails>. Also fn write_bank_hash_details_file() is used from fn dump_then_repair_correct_slots() as well.

I am not sure how I can consolidate the two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, maybe not as easy as I thought. I'd say we can leave it for now

The long term vision is implementing something like BankHashDetailsStreamer. This object could hold the serializer, and then dump out details bank-by-bank. Will definitely help out if someone runs the command with "accounts" mode on a large number of slots.

You could specify the config at the start, the object could retain the config. Then, when you call stream_bank() or whatever we decide to name it, the streamer extracts appropriate details (just the bank hash OR all the hashes + accounts).

To the end user, this is mostly invisible tho. So, like I said, I'm happy to push the PR as-is and then potentially go refactor after the fact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense. For virtual machine debugging it would be super-useful to have the transactions in the file too. Let start with this and go from there

ledger-tool/src/main.rs Outdated Show resolved Hide resolved
@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch 7 times, most recently from 88a01d0 to 0f9eec7 Compare February 28, 2024 12:02
Comment on lines 1069 to 1071
// .default_value() does not work with .conflicts_with() in clap 2.33
// .conflicts_with("verify_slots")
// https://github.com/clap-rs/clap/issues/1605#issuecomment-722326915
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned, would be nice if we were on recent CLAP so this bug didn't hinder us.

Instead, can we do the check ourselves instead. Ie, something like:

if arg_matches.is_present("record_slots") && arg_matches.is_present("verify_slots") {
    eprintln!("Some text, maybe mirror what the CLAP text would be if there weren't that bug");
}

What do you think ? Also, if we do this, we should do it fairly early in the "verify" match case, certainly before calling load_and_process_ledger_or_exit!()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I've added it.

@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch 2 times, most recently from e707b53 to 20a8add Compare February 29, 2024 09:38
This adds:

    --record-slots <FILENAME>
	Write the slot hashes to this file.

    --record-slots-config hash-only|bank
	Store the bank (=accounts) json file, or not.

    --verify-slots <FILENAME>
        Verify slot hashes against this file.

The first case can be used to dump a list of (slot, hash) to a json file
during a replay. The second case can be used to check slot hashes against
previously recorded values.

This is useful for debugging consensus failures, eg:

    # on good commit/branch
    ledger-tool verify --record-slots good.json --record-slots-config=bank

    # on bad commit or potentially consensus breaking branch
    ledger-tool verify --verify-slots good.json

On a hash mismatch an error will be logged with the expected hash vs the
computed hash.
@seanyoung seanyoung force-pushed the ledger-tool-check-slot-hashes branch from 20a8add to e334f34 Compare February 29, 2024 14:46
Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's push this (especially with repo changeover) and we can tweak things from there

@seanyoung seanyoung changed the title ledger-tool: verify: add --verify-slots and --verify-slots-details ledger-tool: verify: add --record-slots and --verify-slots Mar 1, 2024
@seanyoung seanyoung merged commit 9bb59aa into solana-labs:master Mar 1, 2024
35 checks passed
@seanyoung seanyoung deleted the ledger-tool-check-slot-hashes branch March 1, 2024 08:39
@seanyoung
Copy link
Contributor Author

Thanks @steviez

seanyoung added a commit to seanyoung/solana that referenced this pull request Jun 27, 2024
…bs#34246)

ledger-tool: verify: add --verify-slots and --verify-slots-details

This adds:

    --record-slots <FILENAME>
	Write the slot hashes to this file.

    --record-slots-config hash-only|accounts
	Store the bank (=accounts) json file, or not.

    --verify-slots <FILENAME>
        Verify slot hashes against this file.

The first case can be used to dump a list of (slot, hash) to a json file
during a replay. The second case can be used to check slot hashes against
previously recorded values.

This is useful for debugging consensus failures, eg:

    # on good commit/branch
    ledger-tool verify --record-slots good.json --record-slots-config=accounts

    # on bad commit or potentially consensus breaking branch
    ledger-tool verify --verify-slots good.json

On a hash mismatch an error will be logged with the expected hash vs the
computed hash.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants